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Abstract 

Background: Zipf's law and Heaps' law are two representatives of the scaling concepts, which play a 
significant role in the study of complexity science. The coexistence of the Zipf's law and the Heaps' law 
motivates different understandings on the dependence between these two scalings, which is still hardly 
been clarified. 

Methodology/Principal Findings: In this article, we observe an evolution process of the scalings: 
the Zipf's law and the Heaps' law are naturally shaped to coexist at the initial time, while the crossover 
comes with the emergence of their inconsistency at the larger time before reaching a stable state, where 
the Heaps' law still exists with the disappearance of strict Zipf's law. Such findings arc illustrated with a 
scenario of large-scale spatial epidemic spreading, and the empirical results of pandemic disease support 
a universal analysis of the relation between the two laws regardless of the biological details of disease. 
Employing the United Statcs(U.S.) domestic air transportation and demographic data to construct a 
metapopulation model for simulating the pandemic spread at the U.S. country level, we uncover that the 
broad heterogeneity of the infrastructure plays a key role in the evolution of scaling emergence. 

Conclusions/Significance: The analyses of large-scale spatial epidemic spreading help understand 
the temporal evolution of scalings, indicating the coexistence of the Zipf's law and the Heaps' law depends 
on the collective dynamics of epidemic processes, and the heterogeneity of epidemic spread indicates the 
significance of performing targeted containment strategies at the early time of a pandemic disease. 

Introduction 

Scaling concepts play a significant role in the field of complexity science, where a considerable amount 
of efforts is devoted to understand these universal properties underlying multifarious systems [1-4]. Two 
representatives of scaling emergence are the Zipf's law and the Heaps' law. G.K. Zipf, sixty years ago, 
found a power law distribution for the occurrence frequencies of words within different written texts, 
when they were plotted in a descending order against their rank [5]. This frequency-rank relation also 
corresponds to a power law probability distribution of the word frequencies [32]. The Zipf's law is found 
to hold empirically for a great deal of complex systems, e.g., natural and artificial languages [5-9], city 
sizes[10,ll], firm sizes[12], stock market index[13,14], gene expression[15,16], chess opening[17], arts[18], 
paper citations [19], family names [20], and personal donations [21]. Many mechanisms are proposed to 
trace the origin of the Zipf's law [22-24]. 

Heaps' law is another important empirical principle describing the sublinear growth of the number of 
unique elements, when the system size keeps on enlarging[25]. Recently, particular attention is paid to the 
coexistence of the Zipf's law and the Heaps' law, which is reported for the corpus of web texts[26], key- 
words in scientific publication[27], collaborative tagging in web applications[28,29], chemoinformatics[30], 
and more close to the interest in this article, global pandemic spread[31], and etc. 

In [33,34], an improved version of the classical Simon model[35] was put forward to investigate the 
emergence of the Zipf's law, which is deemed to be a result from the existence of the Heaps' law. However, 
[26,32] concluded that the Zipf's law leads to the Heaps' law. In fact, the interdependence of these two laws 
has hardly been clarified. This embarrassment comes from the fact that the empirical/simulated evidence 
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employed to show the emergence of Zipf's law mainly deals with static and finalized spcicmens/results, 
while the Heaps' law actually describes the evolving characteristics. 

In this article, we investigate the relation between these scaling laws from the perspective of cocvolu- 
tion between the scaling properties and the epidemic spread. We take the scenarios of large-scale spatial 
epidemic spreading for example, since the empirical data contain sufficient spatiotcmporal information 
making it possible to visualize the evolution of the scalings, which allows us to analyze the inherent 
mechanisms of their formation. The Zipf's law and the Heaps' law of the laboratory confirmed cases 
are naturally shaped to coexist during the early epidemic spread at both the global and the U.S. levels, 
while the crossover comes with the emergence of their inconsistency as the epidemic keeps on prevailing, 
where the Heaps' law still exists with the disappearance of strict Zipf's law. With the U.S. domestic air 
transportation and demographic data, we construct a fine-grained metapopulation model to explore the 
relation between the two scalings, and recognize that the broad heterogeneity of the infrastructure plays 
a key role in their temporal evolution, regardless of the biological details of diseases. 

Results 

Empirical and Analytical Results 

With the empirical data of the laboratory confirmed cases of the A(H1N1) provided by the World Health 
Organization(WHO)(see the data description in Materials and Methods), we first study the probability- 
rank distribution(_Pi?D) of the cumulative confirmed number( CCiV) of every infected country at several 
given dates sampled about every two weeks. Cj(t) denotes the CCN in a given country j at time t. 
Since Cj(t) grows with time, the distributions at different dates are normalized by the global CCN, 
Cxit) = Y^jCj(t), for comparison. Fig. 1(A) shows the Zipf-plots of the PRD Pt(r) of the infected 
countries' confirmed cases by arranging every Cj(t)/CT(t) > in a descending order for each specimen. 
The maximal rank r tjTOQX (on x-axis) for each specimen denotes the total number of infected countries at 
a given date, and grows as the epidemic spreading. 

At the early stage(the period between April 30th and June 1st, 2009), Pt(r) shows a power law pattern 
P t {r) ~ r~ e , which indicates the emergence of the Zipf's law. We estimate the power law exponent 6 for 
each specimen of this stage by the maximum likelihood method [22, 37], and report its temporal evolution 
in the left part of Fig. 1(C). About sixty countries were affected by the A(H1N1) on June 1st, and most 
of them are countries with large population and/or economic power, e.g., U.S., Mexico, Canada, Japan, 
Australia, China. After June 1st, the disease swept much more countries in a short time, and the WHO 
announcement on June llth[38] raised the pandemic level to its highest phase, phase 6(see Text SI), 
which implied that the global pandemic flu was occurring. At this stage(after June 1st, 2009), Pt(r) 
gradually displays a power law distribution with an exponential cutoff Pt(r) ~ r~ e exp(—r/r c ), where r c 
is the parameter controlling the cutoff effect (sec Text SI), and the exponent 9 gradually reduces to around 
1.7, as shown in Fig. 1(C). Surprisingly, Pt(r) at different dates eventually reaches a stable distribution as 
time evolves(see those curves since June in Fig. 1(A)). Indeed, after June 19th, 9 seems to reach a stable 
value with mild fluctuations, as shown in Fig. 1(C). The characteristics of the temporal evolution of the 
parameter r c is similar to 9, thus we mainly present the empirical results of the exponent 9 in the main 
text and hold the results of r c in Figure SI. In the following, we analyze the evolution of the normalized 
distribution P t (r) by the contact process of an epidemic transmission, regardless of the biological details 
of diseases. 

Straightforwardly, according to the mass action principle in the mathematical epidemiology [39, 40] (see 
Text SI), which is widely applied in studying the epidemic spreading process on a network[41-56], we 
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consider the SIR epidemic scheme here, 

(V [S] V W v inu . J (^f - hVf + l,vf), with rate pvfvf/N^ 

[ j ' j ' j ' \ (vf\l>?-l,vf + 1), with rate „l>f\ [ ) 

where T>^ denotes the number of individuals in compartment [</j](susceptible(S), infectious(I) or perma- 
nently recovered(R)) in a given country j, j3 denotes the disease transmission rate, and infectious individ- 
uals recover with a probability /i. The population in a given country j at time t is Nj(t) = 5^ 
where t = means the time when initially confirmed cases in the entire system are reported. At the 
early stage of a pandemic outbreak, the new introductions of infectious individuals dominate the onset 
of outbreak in unaffected countries. However, after the disease already lands in these countries, the 
ongoing indigenous transmission gradually exceeds the influence of the new introductions, and becomes 
the mainstream of disseminators [57, 58]. According to Eq.(l), in a given infected country j, there are 

Vf™\t + 1) = pvf^vf^/N^t) (2) 

new infected individuals on average at t + 1 days, and the average number of illness at t + 1 days is 

Vf (t + 1) = (1 - M + pvf {t)INj {t))vf (t). (3) 

Defining xO) = -» + pv\ S] {t) / Nj(t) and y(t) = V [ f ] (t) / ! Nj{t), we have 

t t-i 
vf{t + 1) = IJ l 1 + X(t'))vf ] (h) and T>f™\t + 1) = #)>(*) [] I 1 + xtflpf (h), (4) 
t'=ti t'=ti 

where D^(ti) denotes the number of initially confirmed or introduced cases in country j, and is always 

a small positive integer. The CCN of country j at t + 1 days is Cj(t+1) = Cj (t) + pj 7 """ 1 (t + 1). When 
i is large enough, we have 

t-i 

c 3 {t + i)/c 3 (t) = i + py(t) n I 1 + (*i)/Ci(*). (5) 

t'=tl 

Before the disease dies out in country j, Cj(i) keeps increasing from the onset of outbreak[59]. When t 

t-i 

is large enough, it is obviously Cj(t) » 0, < y < 1, -fj, < Af(i') <C /?-/J, thus J] [l+x(f')l is definitely 

t'=ti 

larger than and can hardly be infinity. Tjj\t\) is a small positive integer, thus T>j(ti)/Cj(t) ~ when 
£ is large enough. We therefore have Cj(t + 1)/Cj(i) ~ l,j ^ M(t + 1) for large i, where M(t + 1) is 
the total number of infected countries after t+1 days of spreading. Thus the normalized probability 
P t+1 (r(j)) at t+ 1 day is: 

fwW)) = ^TTTTT = V^TTT = P *Wi)),J < M(t+ 1), with large t, (6) 

where r(j) is the rank of the CCiV of country j in the descending order of the CCN list of all infected 
countries. Eq.(6) indicates that each probability P t (r(j)) is invariant for large t, thus the normalized 
distribution Pt(r) becomes stable when t is large enough. The intrinsic reasons for the emergence of 
these scaling properties are discussed in Modeling and Simulation Results. 

Since the normalized PRD Pt(r) displays the Zipf's law pattern Pt(r) ~ r~ e at the early stage of 
the epidemic, the CCN of the country ranked r is C r (t) ~ Cr(t) • r~ e at this stage. Considering the 
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CCN of the countries with ranks between r and r + Sr. where 5r is any infinitesimal value, we have 
5C r (t) ~ —9r~ e ~ 1 CT(t)5r. Supposing Sr ~ P Cr (t)SC r (t) with Pq t denoting the probability density 
function, we have 



Thus 



PcM ~ -e-^c^it). (7) 



P Cr {t)=A(l-<P)C*- 1 {t)C^{t), (8) 



where <j) = 1 + 6 \ .4 is a constant. According to the normalization condition j^ min ^ Pc r (t)dC r (t) = 1, 
where C max (t)(C mm (t)) is the CCN of the country with the maximal(minimal) value at a give time t, 
we have A = -C^r <p (t)C mm {t)' t '- 1 because (j) = 1 + 9~ x > 1 and C max (t) > 0. Then 

P Cr (t) = (</,- ljC™^- 1 *?-^*). (9) 

At a given date, r can be regarded as the number of countries with the amount of cumulated confirmed 
cases which is no less than C r (t), then 



r = 



c™ M (t) 

M(t)p(c;(t))dc;(t). (io) 

cv(t) 



Recalling r ~ (Cr(t) /C r (t))v , we have 

where n = 1/9. At the early stage corresponding to the period between April 30th and June 1st, C mm (t) 
is one according to the WHO data. Therefore, we have 

M(t)~C%(t),Ti = l/6, (12) 

which indicates that the Heap's law[25,26,31,32] can be observed in this case. The empirical evidence 
for the emergence of the Heap's law at this stage is shown in the middle part of Fig. 1(E). The Heaps' 
exponent r\ is obtained by the least square method[31,32], and the relevance between 9 and r\ is reported 
in Table 1. 

At the latter stage(the period after June 1st, 2009), the exponential tail of the distribution Pt(r) leads 
to a deviation from the strict Zipf's law. However, with a steeper exponent r\ w 0.473, the Heaps' law 
still exists, as shown in the right part of Fig. 1(E). Though the two scaling laws are naturally shaped to 
coexist during the early epidemic spreading, their inconsistency gradually emerges as the epidemic keeps 
on prevailing. Indeed, in the Discussion of [32], without empirical or analytical evidence, Lii et al have 
intuitively suspected that there may exist some unknown mechanisms only producing the Heaps' law, 
and it is possible that a system displaying the Heaps' law does not obey the strict Zipf's law. Here we 
not only verify this suspicion with the empirical results, but also explore the substaintial mechanisms of 
the evolution process in Modeling and Simulation Results, where we uncover the important role of the 
broad heterogeneity of the infrastructure in the temporal evolution of scaling emergence. 

We also empirically study the evolution of scaling emergence of the epidemic spreading at the coun- 
trywide level. Since the United States is one of the several earliest and most seriously prevailed countries 
of the A(H1N1)[60], we mainly focus on the A(H1N1) spreading in the United States. With the empirical 
data of the laboratory confirmed cases of the A(H1N1) provided by the Centers for Disease Control and 
Prevention(CDC)(see the data description in Materials and Methods), in Fig. 1(B) we report the PRD of 
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the CCN of infected states, P t us (r), at several given dates sampled about every two weeks. Our findings 
suggest a crossover in the temporal evolution of P t MS (r). At the early stagc(the period before May 15th), 
P" s (r) shows a power law pattern P" s (r) ~ r~ 6us with a much smaller exponent 8 US than that of the 
WHO results. Washington D.C. and 46 states (excluding Alaska, Mississippi, West Virginia, Wyoming) 
were affected by A(H1N1) on May 15th. After May 15th, P t us (r) gradually becomes a power law distri- 
bution with an exponential cutoff, P" s (r) ~ r~ 9us exp(— r/r^ s ), which leads to a deviation from the strict 
Zipf 's law. In this case, the exponent 8 US gradually reduces and reaches a stable value 0.45(see Fig. 1(D)), 
which conforms to the fact that P" s (r) of different dates eventually reaches a stable distribution as time 
evolves. The temporal evolution of the exponent 9 US of all data are shown in Figure S2. r c keeps the 
value around 14 after June 12th, 2009. 

The relation between M us (t) and C^ s (t) is shown in Fig. 1(F). Though at first glance this figure 
provides us an impression of the sublinear growth of the number of infected states M us (t) when the 
cumulative number of national total patients C^ s (t) increases, we could not use the least square method 
here to estimate the Heaps' exponent rj us for several reasons: (i) the amount of data at each stage is quite 
small; (ii) there are several periods that M us (t) keeps unchanged(May 6th — > May 7th, M us (t) = 41; 
May 12th -> May 13th, M us {t) = 45; May 18th -> May 27th, M us (t) = 48); (hi) the magnitude of C% s {t) 
is much larger than that of M us (t); (iv) after June 1st, 2009, Washington D.C. and all 50 states of the 
United States were affected by the A(H1N1). Define M max the maximal number of the geographical 
regions the epidemic spreads to. In the U.S. scenario, A/™ ax = 51. When M us (t) reaches M™ ax on June 
1st, P" s (r) evolves and becomes stable after June 26th(see Fig.l(B,D)). In the Modeling and Simulation 
Results, we explore the relation between these two scalings with a fine grained metapopulation model 
characterizing the spread of the A(H1N1) at the U.S. level in detail. 

Note that these scaling properties are not exceptive for the A(H1N1) transmission. More supported 
exemplifications are reported in Figure S3, e.g. the cases of SARS, Avian Influenza(H5Nl). It is worth 
remarking that the normalized distribution Pt(r) almost keeps the power law pattern during the whole 
spreading process of the global SARS. This phenomenon might result from the intense containment 
strategies, e.g. patient isolation, enforced quarantine, school closing, travel restriction, implemented by 
individuals or governments confronting mortal plague. 

Modeling and Simulation Results 

The above analyses, however, do not tell the whole story, because the intrinsic reasons for the emergence of 
these scaling properties have not been explained. Some additional clues from the perspective of Shannon 
entropy [61] of a system might unlock the puzzle. 

Nowadays, population explosion in the urban areas, massive interconncctivity among different geo- 
graphical regions, and huge volume of human mobility are the factors accelerating the spread of infec- 
tious disease[62,74]. At a large geographical scale, one main class of models is the metapopulation model 
dividing the entire system into several interconnected subpopulations[58,63-74,87,88]. Within each sub- 
population, the infectious dynamics is described by the compartment schemes, while the spread from 
one subpopulation to another is due to the transportation and mobility infrastructures, e.g., air trans- 
portation. Individuals in each subpopulation exist in various discrete health compartments (status), i.e. 
susceptible, latent, infectious, recovered, and etc., with compartmental transitions by the contagion pro- 
cess or spontaneous transition, and might travel to other subpopulations by vehicles, e.g., airplane, in 
a short time. The metapopulation model can not only be employed to describe the global pandemic 
spread when we regard each subpopulation as a given country, but also be used to simulate the disease 
transmission within a country when each subpopulation is regarded as a given geographical region in the 
country. Here we mainly consider the spread of pandemic influenza at the U.S. country level for threefold 
reasons: (i) the computational cost of simulating global pandemic spread is too tremendous to implement 
on a single PC or Server[58,70,72,81,87]; (ii) the IATA or OAG flight schedule data, which is widely 
used to obtain the global air transportation network, do not provide the attendance and flight-connecting 
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information(see data description in Materials and Methods); (iii) the United States is one of the several 
earliest and most seriously prevailed countries [60]. 

We construct a metapopulation model at the U.S. level with the U.S. domestic air transportation 
and demographic statistical data[75-78] (detailed data description is provided in Materials and Methods, 
and a full specification of the simulation model is reported in Text SI). Define a subpopulation as a 
Mctropolitan/Micropolitan Statistical Areas(MSAs/^SAs)[75] connected by a transportation network, in 
this article, the U.S. domestic airline network(USDAN). The USDAN is a weighted graph comprising 
V = 406 vertices (airports) and E = 6660 weighted and directed edges denoting flight courses. The 
weight of each edge is the daily amount of passengers on that flight course. The infrastructure of the 
USDAN presents high levels of heterogeneity in connectivity patterns, traffic capacities and population(see 
Fig. 2). The disease dynamics in a single subpopulation is modeled with the Susceptible-Latent-Infcctious- 
Recovered(SLlR) compartmcntal scheme, where the abbreviation L denotes the latent compartment which 
experiences e^ 1 days on average for an infected pcrson(Thc SIR epidemic dynamics discussed at Empirical 
and Analytical Results is an reasonable approximation, which actually simplifies the epidemic evolution 
to a Markov chain to help us study the issue, and the value of the reproductive number Rq does not 
depend on e, we therefore ignore the compartment L there). 

The key parameters determining the spreading rate of infections are the reproductive number Rq and 
the generation time Gt- Ro is defined as the average amount of individuals an ill person infects during 
his or her infectious period in a large fully susceptible population, and G t refers to the sum of the 
latent period e _1 and the infectious period /it -1 . In our metapopulation model, Rq = j3 ■ (i . The initial 
conditions of the disease are defined as the onset of the outbreak in San Diego-Carlsbad-San Marcos, CA 
MSA on April 17th, 2009, as reported by the CDC[79]. Assuming a short latent period value = 1.1 
days as indicated by the early estimates of the pandemic A(H1N1)[80], which is compatible with other 
recent studies[81,82], we primarily consider a baseline case with parameters: Gt = 3.6, yT 1 = 2.5 days and 
Ro = 1.75, which are higher than those obtained in the early findings of the pandemic A(H1N1)[80], but 
they are the median results in other subsequent analyses [8 1,83]. Fixing the latency period to e _1 = 1.1 
days, we also employ a more aggravated baseline scenario with parameters: Gt =4.1, = 3 days and 
Ro = 2.3, which are close to the upper bound results in[81, 83-85]. 

In succession, we characterize the disease spreading pattern by information entropy, which is custom- 
arily applied in information theory. To quantify the heterogeneity of the epidemic spread at the U.S. 
level, we examine the prevalence at each time t, ij(t) = D^\t)/Nj(t), for all subpopulations, and intro- 
duce the normalized vector with components p^(t) = ij(t)/ ik(t). Then we measure the level of 
heterogeneity of the disease prevalence by quantifying the disorder encoded in with the normalized 
entropy function 

which provides an estimation of the geographical heterogeneity of the disease spread at time t. If the 
disease is uniformly influencing all subpopulations(e.g., all prevalences are equivalent), the entropy reaches 
its maximum value = 1. On the other hand, starting from H M = 0, which is the most localized and 
heterogeneous situation that just one subpopulation is initially affected by the disease, H W (t) increases 
as more subpopulations are influenced, thus decreasing the level of heterogeneity. 

In order to better uncover the origin of the emergence of the scaling properties, we compare the 
baseline results with those obtained on a null model UNI. The UNI model is a homogeneous Erdos-Rcnyi 
random network with the same number of vertices as that of the USDAN, and the generating regulation 
is described as follows: for each pair of vertices (i, j), an edge is independently generated with the uniform 
probability p e = (k)/V, where (k) = 16.40 is the average out-degree of the USDAN. Moreover, the weights 
of the edges and the populations are uniformly equal to their average values in the USDAN, respectively. 
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Therefore, the UNI model is completely absent from the heterogeneity of the airline topology, flux and 
population data. 

Different evolving behaviors between the UNI scenarios and the baselines (real airline cases) provide 
a remarkable evidence for the direct dependence between the scaling topropertics and the heterogeneous 
infrastructure. Fig.3(A,C) show the comparison of the PRD between the baseline results and the UNI 
outputs at several given dates sampled about every 30 days, where each specimen is the median result over 
all runs that led to an outbreak at the U.S. level in 100 random Monte Carlo realizations. In Fig. 3(A), 
we consider the situation of Rq = 1.75, and do observe that the evolution of PRD of the baseline case 
experiences two stages: a power law at the initial time and an exponentially cutoff power law at a larger 
time. However, the UNI scenario shows a distinct pattern: as time evolves, the middle part of the PRD 
grows more quickly, and displays a peak which obviously deviates scaling properties. Fig. 3(C) reports 
the situation of Rq = 2.3. In this aggravated instance, the PRD of the UNI scenario actually becomes 
rather homogeneous when t is large enough(see the curve of July 17th of the UNI scenario in Fig. 3(C)). 
Fig.3(B,D) present the comparison of the information entropy profiles between the baseline results and 
the UNI outputs when Rq = 1.75, Rq = 2.3, respectively. The completely homogeneous network UNI 
shows a homogeneous cvolution(ijM « 1) of the epidemic spread in a long period(see the light cyan areas 
in Fig.3(B,D)), with sharp fallings at both the beginning and the end of the outbreak. However, we 
observe distinct results in the baselines, where is significantly smaller than 1 for most of the time, 
and the long tails indicate a long lasting heterogeneity of the epidemic prevalence. These analyses signal 
that the broad heterogeneity of infrastructure plays an essential role in the emergence of scalings. 

We further explore the properties of the two scalings and their relation with the baseline case of 
Rq = 1.75 in detail. Since each independent simulation generates a stochastic realization of the spreading 
process, we analyze the statistical properties with 100 random Monte Carlo realizations, measure the 
normalized PRD of the CCN of infected MSAs/^uSAs for each realization that led to an outbreak at 
the U.S. level, and report the median result of the PRD P^ MS (r) of each day. From t = 26 to t = 39, 
P/ us (r) clearly shows a power law pattern Pl us (r) ~ r~ 6 "^, which implies the emergence of the Zipf's 
law(when t < 26, just several regions are affected by the disease). The exponent Q' us at each date is 
estimated by the maximum likelihood mcthod[22,37], and the temporal evolution of 0' us is reported in 
the left part of Fig. 4 (A). When t > 39, P[ us (r) gradually becomes an exponentially cutoff power law 
distribution P^ us (r) ~ r" e «=exp(— r/r" s ), and the exponent ff gradually reduces and reaches a stable 
value of 0.574 with neglectable fluctuations when t > 126(see Fig. 4(A)). Here we do not show the error 
bar since the fitting error on the exponent is far less(10 -2 ) than the value of 9' us by the average of 100 
random realizations. The inset of Fig. 4(A) shows the increase of the number of infected regions M' us (t) 
as time evolves. When t > 110, more than 400 subpopulations reports the existence of confirmed cases, 
thus M' us (t) tends to reach its saturation. 

Fig. 4(B) shows the relation between M' us {t) and C^ ls (t)(the national cumulative number of patients). 
Since P t /tis (r) displays a power law of P/ us (r) = b • r e « s at the early stage of the period between t = 26 
and t = 39, it is reasonable to deduce the existence of the Heaps' law 

MUt) = (Or us (t) ■ b)<- , n ' us = 1/C, (14) 

according to the analyses in Empirical and Analytical Results. In order to verify this assumption, we 
estimate the exponent r]' us using Eq.(14), and report the relevance between 0' us and n' us in Table 2 (the 
amount of data in this period is not sufficient to get a accurate estimation of the exponent rf us with the 
least square method). When t > 39, though P[ us (r) gradually deviates the strict Zipf's law, the Heaps' 
law of the relation between M' us {t) and C^ s (t) still exists till M' us (t) tends to reach its saturation(see 
the middle part in Fig. 4(B)). 
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Discussion 

Zipf's law and Heaps' law are two representatives of the scaling concepts in the study of complexity 
science. Recently, increasing evidence of the coexistence of the Zipf's law and the Heaps' law motivates 
different understandings on the dependence between these two scalings, which is still hardly been clarified. 
This embarrassment derives from the contradiction that the empirical or simulated materials employed to 
show the emergence of Zipf's law are often finalized and static specimens, while the Heaps' law actually 
describes the evolving characteristics. 

In this article, we have identified the relation between the Zipf's law and the Heaps' law from the 
perspective of coevolution between the scalings and large-scale spatial epidemic spreading. We illustrate 
the temporal evolution of the scalings: the Zipf's law and the Heaps' law are naturally shaped to coexist 
at the early stage of the epidemic at both the global and the U.S. levels, while the crossover comes with 
the emergence of their inconsistency at a larger time before reaching a stable state, where the Heaps' law 
still exists with the disappearance of strict Zipf's law. 

With the U.S. domestic air transportation and demographic data, we construct a metapopulation 
model at the U.S. level. The simulation results predict main empirical findings. Employing information 
entropy characterizing the epidemic spreading pattern, we recognize that the broad heterogeneity of the 
infrastructure plays an essential role in the evolution of scaling emergence. These findings are quite 
different from the previous conclusions in the literature. For example, studying a phenomenologically 
self-adaptive complete network, Han et al. claimed that scaling properties are dependent on the intensity 
of containment strategics implemented to restrict the interregional travel [31]. In [36], Picoli Junior et 
al. considered a simple stochastic model based on the multiplicative process [23], and suggested that 
seasonality and weather conditions, i.e., temperature and relative humidity, also dominates the temporal 
evolution of scalings because they affect the dynamics of influenza transmission. In this work, without 
the help of any specific additional factor, we directly show that the evolution of scaling emergence is 
mainly determined by the contact process underlying disease transmission on an infrastructure with huge 
volume and heterogeneous structure of population flows among different geographic regions. (The effects 
of the travel-related containment strategies implemented in real world can be neglected, since the number 
of scheduled domestic and international passengers of the U.S. air transportation only declined in 2009 
by 5.3% from 2008[86]. In fact, the travel restrictions would not be able to significantly slow down the 
epidemic spread unless more than 90% of the flight volume is reduced[58,66,69,70,88].) 

In summary, our study suggests that the analysis of large-scale spatial epidemic spread as a promising 
new perspective to understand the temporal evolution of the scalings. The unprecedented amount of 
information encoded in the empirical data of pandemic spreading provides us a rich environment to 
unveil the intrinsic mechanisms of scaling emergence. The heterogeneity of epidemic spread uncovered 
by the metapopulation model indicates the significance of performing targeted containment strategies, 
e.g. vaccination of prior groups, targeted antiviral prophylaxis, at the early time of a pandemic disease. 

Materials and Methods 
Data Description 

In this article, in order to construct the U.S. domestic air transportation network, we mainly utilize the 
"Air Carrier Traffic and Capacity Data by On-Flight Market report (December 2009)" provided by the 
Bureau of Transportation Statistics(BTS) databasc[76]. This report contains 12 months' data covering 
more than 96% of the entire U.S. domestic air traffic in 2009, and provides the monthly number of 
passengers, freight and/or mail transported between any two airports located within the U.S. boundaries 
and territories, regardless of the number of stops between them. This BTS report provides a more 
accurate solution for studying aviation flows between any two U.S. airports than other data sources(the 
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attendance and the flight- connecting information in the OAG flight schedule data are commonly unknown, 
while the datasets adopted in [63,64,66,69] primarily consider the international passengers). In order to 
study the epidemic spread in the Continental United States where we have a good probability to select 
citizens living and moving in the mainland, we get rid of the airports as well as the corresponding flight 
courses located in Hawaii, and all offshore U.S. territories and possessions from the BTS report. 

In order to obtain the U.S. demographic data, we resort to the "OMB Bulletin NO. 10-02: Update 
of Statistical Area Definitions and Guidance on Their Uses" [75] provided by the United States Office 
of Management and Budgct(OMB), and the 11 Annual Estimates of the Population of Metropolitan and 
Micropolitan Statistical Areas: April 1, 2000 to July 1, 2009" [77] provided by the United States Census 
Bureau(CB). OMB defines a Metropolitan Statistical Area(MS A) (Micropolitan Statistical Area, /iSA) 
as one or more adjacent counties or county equivalents that have at least one urban core area of at 
least 50,000 population(10,000 population but less than 50,000), plus adjacent territory that has a high 
degree of social and economic integration with the core. For other regions with at least 5,000 population 
but less than 10,000, we use the American FactFindcr[78] provided by the CB to get the demographic 
information. We do not consider sparsely populated areas with population less than 5,000, because they 
are commonly remote islands, e.g. Block Island in Rhode Island, Sand Point in Alaska. 

Before constructing the metapopulation model, we take into account the fact that there might be 
more than one airport in some huge metropolitan areas. For instance, New York-Northern New Jersey- 
Long Island(NY-NJ-PA MSA) has up to six airports (their IATA codes: JFK, LGA, ISP, EWR, HPN, 
FRG), Los Angeles-Long Beach-Santa Ana(CA MSA) has four airports (their IATA codes: LAX, LGB, 
SNA, BUR), and Chicago- Joliet-Naperville(IL- IN- WI MSA) has two airports (their IATA codes: MDW, 
ORD). Assuming a homogeneous mixing inside each subpopulation, we need to assemble each group of 
airports serving the same MSA//iSA, because the mixing within each given census areas is quite high and 
cannot be characterized by fine-grained version of subpopulations for every single airport. We searched 
for groups of airports located close to each other and belonged to the same metropolitan areas, and then 
manually aggregated the airports of the same group in a single "super-hub" . 

The full list of updates of the pandemic A(H1N1) human cases of different countries is available on 
the website of Global Alert and Response(GAR) of World Health Organization(WHO)(WHO website. 
http://www.who.int/csr/disease/swineflu/updates/en/index.html. Accessed 2011 May 24). It is worth 
remarking that WHO was no longer updating the number of the cumulated confirmed cases for each 
country after July 6th, 2009, but changed to report the number of confirmed cases on the WHO Region 
level(the Member States of the World Health Organization(WHO) are grouped into six regions, including 
WHO African Region(46 countries), WHO European Rcgion(53 countries), WHO Eastern Mediterranean 
Region(21 countries), WHO Region of the Americas(35 countries), WHO South-East Asia Region (11 
countries), WHO Western Pacific Region(27 countries) . (WHO website. 
http://www.who.int/about/regions/en/index.html. Accessed 2011 May 24). 

The cumulative number of the laboratory confirmed human cases of A(H1N1) flu infection of each 
U.S. state is available at the website of 2009 A(H1N1) Flu of the Centers for Disease Control and Preven- 
tion(CDC)(CDC website, http://cdc.gov/hlnlflu/updates/. Accessed 2011 May 24), where the detailed 
data were started from April 23, 2009, to July 24, 2009. After July 24, the CDC discontinued the report- 
ing of individual confirmed cases of A(H1N1), and began to report the total number of hospitalizations 
and deaths weekly. 

The data of the human cases of global SARS and global Avian influenza(H5Nl) are available at the 
website of the Disease covered by GAR of WHO (WHO website, http://www.who.int/csr/disease/en/. 
Accessed 2011 May 24). 
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Tables 

Table 1. The empirical results of the parameters 8 and rj, and their relevance at the early 
time(the period between April 30th and June 1st, 2009), using 2009 Pandemic A(H1N1) 
data collected by the WHO. 
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Table 2. The value of the parameters 6' us and r)' us for the simulation results at the early 
time of the period between t = 26 and t = 39. 
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Simulation Model Design 

As a basic modeling scheme, we use the metapopulation approach, which explicitly considers the 
geographical structure in the model by introducing multiple subpopulations coupled by individuals' mo- 
bility. More specifically, the subpopulations correspond to the metropolitan areas, and the dynamics of 
individuals' mobility is described by enplaning between any two regions, 
(i) Infection Dynamics in a Single Subpopulation. 

The infection dynamics takes place within each single subpopulation, and is described by a homo- 
geneously mixed population with an influenza-like illness compartmcntalization in which each individual 
exists in just one of the following discrete classes such as susccptible(S), latent(L), infectious(I) or per- 
manently recovered(R). In each subpopulation j, the population is Nj, and V^\t) is the number of 

individuals in the class [ip] at time t. By definition, it is evident that Nj = ^ T>j (t). Two essential 
kinds of the disease evolution processes are considered in the infection dynamics: the contagion pro- 
cess(e.g., a susceptible individual acquires the infection from any given infectious individual and becomes 
latent with the rate /3, where /3 is the transmission rate of a disease) and the spontaneous transition 
of individual from one compartment to anothcr(i.e. latent ones become infectious with a probability e, 
or the infectious individuals recover with a probability [i, where e _1 and are the average latency 
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time and the average infection duration, respectively). Schematically, the stochastic infection dynamics 
is given by 

f {Vf-l, Vf ] + 1 , Vf ,vf ] ),with rate (3V [ f ] vf /N s , 
{Vf,vf\vf,vf ] ) => {Vf,vf - l,vf + l,vf), with rate eVf\ 

[ (V l f ] , Vf ] , Vf - 1, Vf ] + 1), with rate fxV [ - ] , 

where the first reaction reflects the fact that each susceptible in subpopulation j would be infected by 
contacting any infectious individuals with probability /3T>f /Nj, therefore the number of new infections 
generated in subpopulation j at time t + 1 is extracted from a binomial distribution with the proba- 
bility /3T>f(t)/Nj(t) and the number of trials T>f\t); the second and the third reactions represent the 
spontaneous transition process. 

(iiQEpidemic Transmission among Different Subpopulations. 

As individuals travel around the country, the disease may spread from one area to another. There- 
fore, in addition to the infection dynamics taking place inside each subpopulation, the epidemic spreading 
at a large geographical scale is inevitably governed by the human mobility among different subpopula- 
tions by means of the domestic air transportation. Since the BTS report reflects the actual aviation flows 
between any two U.S. airports, we define a stochastic dispersal operator V j ({V^}) representing the net 
balance of individuals in a given compartment T>^ that entered in and left from each subpopulation j. 
In each subpopulation j, the dispersal operator is expressed as 



where Xjt (T>^ ) describes the daily number of individuals in the compartment [ip] traveling from subpop- 
ulation j to subpopulation I. In the scenario of air travel, this operator is relative to the passenger traffic 
flows and the population. Neglecting multiple legs travels and assuming the well mixing of population 
inside each subpopulation, we deduce that the probability of any individual traveling on each connection 
j — > £ everyday is given by pji = Wji/Nj, where ujji represents the daily passenger number from j to I. 

The stochastic variables Xji(T>j) therefore follow the multinomial distribution 



where — Xji) indicates daily number of non-traveling individuals of compartment [tp] staying 

in subpopulation j. It is noticeable that the population Nj of each subpopulation keeps constant, e.g., 
Vj(X> M ) = 0, because the passenger flows are balanced on each pair of connections in this article. 



Pandemic phases denned by the WHO 

1. Interpandemic period 

Phase 1: no new influenza virus subtypes circulating among animals have been reported to cause 
infections in humans. 

Phase 2: a new influenza virus subtypes circulating among domesticated or wild animals is known 
to have caused infection in humans, and is therefore considered a potential pandemic threat. 
2. Pandemic alert period 

Phase 3: a new influenza virus subtypes has caused sporadic cases or small clusters of disease 
in people, but has not resulted in human-to-human transmission sufficient to sustain community-level 
outbreaks. 
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Phase 4: it is characterized by verified human-to-human transmission of a new influenza virus 
subtypes able to cause "community-level outbreaks" . Though the virus is not well adapted to humans, 
the ability to cause sustained disease outbreaks in a community marks a significant upwards shift in the 
risk for a pandemic. 

Phase 5: it is characterized by human-to- human spread of the virus into at least two countries in 
one WHO region. While most countries will not be affected at this stage, the declaration of Phase 5 is a 
strong signal that a pandemic is imminent. 

3. Pandemic period 

Phase 6: it is characterized by community level outbreaks in at least one other country in a different 
WHO region in addition to the criteria defined in Phase 5. This phase indicates that a global pandemic 
is under way. 

Power Law Distribution with an Exponential Cutoff 

In fact, little real systems do display a perfect power law pattern for the Zipf's distribution or 
probability density distribution[l,2]. When an exponential cutoff is added to the power law function, 
the fit is substantially improved for dozens of systems, e.g., the forest fires, earthquakes, web hits, email 
networks, sexual contact networks. The cutoff indicates that there is a characteristic scale, and that 
infrequently super-enormous events are exponentially rare. Strictly speaking, a cutoff power law should 
always fit the data at least as good as a pure one(just let the cutoff scale go to infinity), thus the power 
law distribution can be deemed as a subset of the exponentially cutoff power law[2]. 

The Mass Action Principle 

Prof. Hamcr postulated that the course of an epidemic depends on the rate of contact between 
susceptible and infectious individuals. This conception plays a significant role in the study of epidemi- 
ology; it is the so-called 'mass action principle' in which the net rate of spread of infection is assumed 
to be proportional to the product of the density of susceptible persons times the density of infectious 
individuals [3,4]. 

1. Newman ME J (2005) Power laws, Pareto distributions and Zipf's law. Contemporary Physics 46: 
323-351. 

2. Clauset A, Shalizi CR, and Newman MEJ (2009) Power-law distributions in empirical data. SIAM 
Review 51, 661-703. 

3. Anderson RM and May RM (1991) Infectious Diseases of Humans: Dynamics and Control (Oxford 
Unvi. Press, Oxford). 

4. Hamer WH (1906) The Milroy Lectures On Epidemic disease in England — The evidence of 
variability and of persistency of type. The Lancet 167: 733-739. 
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Figure 1. The empirical results of A(H1N1). (A) The Zipf-plots of the normalized 
probability- rank distributions Pt(r) of the cumulated confirmed number of every infected 
country at several given date sampled about every two weeks, data provided by the 
WHO. (B) The Zipf-plots of P" s (r) at several given data sampled about every two weeks, 
data provided by the CDC. (C) Temporal evolution of the estimated exponent 9 of the 
normalized distribution Pt(r). (D) Temporal evolution of the estimated exponent 9 US of 
the normalized distribution P" s (r) of the period after May 15th. (E) The sublinear 
relation between the number of infected countries M(t) and the cumulative number of 
global confirmed cases Cr(t), data collected by the WHO. (F) The sublinear relation 
between the number of infected states M us (t) and the cumulative number of national 
confirmed cases C^ s (t), data collected by the CDC. The shaded areas in the figures 
(C,E,F) corresponds to their different evolution stages, respectively. 



19 




10° 10 1 10 2 10 3 

Rank of S 




1CT U — — — I 

10° 10 1 10 2 10 3 

Rank of N 



Figure 2. The heterogeneity of the USDAN's infrastructure. (A) The degree distribution 
P{k) follows a power law pattern on almost two decades with an exponent 1.30±0.03. (B) 
shows that the probability-rank distribution of the traffic outflux Sj = Y^tev^tfi where v 
denotes the set of neighbors belonging to the vertex j and the weight uijt of a connection 
between two vertices is the number of passengers traveling a given route per day, is 

skewed and heterogeneously distributed. (C) shows that the probability-rank distribution 
of populations is skewed and heterogeneously distributed. 
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Figure 3. Comparisons of the scaling properties between the UNI scenarios and the 
baseline cases. (A,C) present the comparison of the PRD P/ us (r) of the CCN of every 
infected MSA/^lSA between the baselines and the UNI scenarios at several given date 
sampled about every 30 days when Rq = 1.75, Rq = 2.3, respectively. (B,D) present the 
comparison of the information entropy profiles between the baselines and the UNI results 
when Rq = 1.75, Rq = 2.3, respectively. Each data in these figures are the median results 
over all runs that led to an outbreak at the U.S. level in 100 random Monte Carlo 
realizations. 



21 




20 40 60 80^ 100 120 140 



^10 1 , 



""I I I 1 

:(B) 




i i 


- 








/=110 . 








0.379±0.004 


-. oa( ((((((( ) 
i i ' 




i i 


i i i 



10° 10 1 10 2 10 3 10 4 

c. 



10 5 10 6 10 7 10 8 10 £ 

- s (t) 



Figure 4. The statistical results of the scaling properties of our metapopulation model. 
(A) Temporal evolution of the estimated exponent 0' us of the normalized distribution 
P[ us (r). The inset shows the growing of the number of infected subpopulations M' us {t) with 
time t. (B) The relation between the number of infected subpopulations M' us (t) and the 
national cumulative confirmed cases C^ s (t). The shaped areas in the figures corresponds 
to their different evolution stages, respectively. Each data in these figures are the median 
results over all runs that led to an outbreak at the U.S. level in 100 random Monte Carlo 
realizations. 
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Figure SI. The temporal evolution of the estimated parameter r c , data provided by the 
WHO. 
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Figure S2. The temporal evolution of the estimated exponent 8 US for all data provided by 
the CDC. 
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Figure S3. The empirical results of the SARS and avian influenza(H5Nl). (A) shows the 
normalized probability-rank distribution of the cumulated confirmed number of every 
infected country around the world at several given date sampled about every four weeks, 
data provided by the WHO (http://www.who.int/csr/sars/country/en/index.html). (B) 
shows the normalized probability-rank distribution of the cumulated confirmed number of 
every infected country around the world at several given date sampled about every half a 
year, data provided by the WHO (http://www.who.int/csr/disease/avianjnfluenza 
/country /en/). 



